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Abstract. Given a pattern P and a text T, both strings over a binary 
alphabet, the binary jumbled string matching problem consists in telling 
whether any permutation of P occurs in T. The indexed version of this 
problem, i.e., preprocessing a string to efficiently answer such permu- 
tation queries, is hard and has been extensively studied in the last few 
years. Currently the best bounds for this problem are 0(n 2 / log 2 n) (with 
0(n) space and O(l) query time) [7] and 0(r 2 logr) (with 0(|L|) space 
and 0(log|L|) query time) [2], where r is the length of the run-length 
encoding of T and \L\ = 0(n) is the size of the index. In this paper we 
present new results for this problem. Our first result is an alternative 
construction of the index by Badkobeh et al. [2] that obtains a trade-off 
between the space and the time complexity. It has 0(r 2 log k + n/k) com- 
plexity to build the index, O(logfc) query time, and uses 0{n/k + |L|) 
space, where k is a parameter. The second result is an 0(n 2 log 2 w/w) 
algorithm (with 0(n) space and O(l) query time), based on word-level 
parallelism where w is the word size in bits. 



1 Introduction 

The umbrella term "approximate string matching" comprises a plethora of match- 
ing models; one of them does not distinguish between permutations of the pattern 
string. The task of jumbled string matching [3] , in its decision version, consists 
in telling if any permutation of a pattern P occurs in a text T, both strings over 
a finite alphabet. In the literature, the binary version of this problem, i.e., such 
that the alphabet for P and T is binary, has been given most attention, and 
our paper is also restricted to this case. More formally, the binary jumbled string 
matching problem can be stated as follows: We are given a text T of length n 
over the alphabet U = {0, 1}, the length m of a pattern over the same alphabet, 
and the number k of symbols 1 in the pattern. The task is to answer efficiently if 
there exists a substring of T of length to containing exactly k ones. The pattern 
is usually represented as the pair (to — k,k), called a Parikh vector. 

The online version of this problem can be trivially solved with a 0(n) time 
algorithm for a single pattern. It is more interesting, however, to build an index 
for T making it possible to answer queries much faster, even in constant time. 



The query pattern lengths are arbitrary (can be any values from 1 up to n). Each 
index-based algorithm can be described by a triple {(f(n),h(n)),g(n)), where 
f(n) is the preprocessing time, h(n) is the preprocessing space (which is usually 
also the size of the resulting index), and g{n) is the query time. We assume the 
word-RAM model of computation. 

A fundamental observation concerning the binary jumbled string matching 
is an interval property: 

Lemma 1 (|3j) If, for a given text T and a pattern length m, the answer is 
positive for some (m — ki,ki) and (m — ki, fe); where ki < hi, it is positive also 
for all (m — k, k) such that k\ < k < k%- 

A practical consequence of this lemma is that it is enough to find the mini- 
mum and the maximum number of ones for a given pattern length m, to be able 
to give the answers for all h for this m. To avoid complex notation, we will call 
those values simply as minOne and maxOne, respectively. 

Currently the best results for this problem are ((0(n 2 /log 2 n),0(n)),0(l)) 
by Moosa and Rahman [7] and {(0(r 2 logr), 0(\L\)), 0(log |L|)> by Badkobeh 
et al. [2], where r is the length of the run- length encoding of T and \L\ = 0(n) 
is the size of their index structure (in practice, as shown in the cited work, \L\ 
tends to be comparable with r). 

Recently, an interesting result was presented by Cicalese et al. [4|. They 
showed how to build an index for all Parikh vectors of a binary string in 0(n 1+ri ) 
time, which leaves a chance for false positives, i.e., may report a Parikh vector 
not occurring in the string. 

We present two results. First, we show that the index from [5] can be easily 
modified to become an ((0(r 2 logfc + n/k),0(n/k + \L\)), 0(log k)) solution, 
where A; is a trade-off parameter. In particular, for k = 1 we obtain an index 
with ((0(r 2 + n),0(n)),0(l)) complexity. While the index space is increased, 
we think such a trade-off is often preferable. The second result is an algorithm, 
based on word-level parallelism, with {{0(n 2 log 2 w/w), 0(n)), 0(1)) complexity, 
where w is the word size in bits. 

Our algorithms, like all the others for this problem, are based on the interval 
property. We note that for each interval size it is enough to present a procedure 
only for finding maxOne, since minOne is equal to maxOne on negated input, 
i.e., where Os become Is and vice versa. Throughout the paper we assume that 
all logarithms are in base 2. 

2 Basic notions and definitions 

Let £ denote a finite alphabet and £ m the set of all possible sequences of 
length m over S. \S\ is the length of string S, S[i],i > 0, denotes its (i + l)-th 
character, and S[i ... j] its substring between the (i + l)-st and the (j + l)-st 
characters (inclusive). For a binary string S over the alphabet {0, 1} we denote 
with |5|o and |5|i the number of O's and l's in S, respectively. The Parikh 
vector of S is the pair (|5|o, We say that a Parikh vector (x,y) occurs 



in a string S if there exists a substring of S such that its Parikh vector is 
equal to (x,y). The run- length encoding of a binary string S is the sequence 
(oi, b\, a,2, 62 • • • j o-r, b r ) of non-negative integers such that a% > for i = 2, . . . , r, 
and 6i > for i = 1, . . . ,r - 1, and S = ai l 6l a2 l b2 . . . ar l br . The maximal 
substrings of a binary string containing only Os (Is) are called 0-runs (1-runs). 

3 A variant of the Badkobeh et al. index 

Let T be a binary string of length n over S = {0, 1} and whose run-length en- 
coding has length r. In this section we present an alternative method to build the 
Corner Index by Badkobeh et al. for T that has ((0(r 2 log k + n/k),0(n/k + 
\L\)), O(logfe)) complexity, where A: is a parameter. While our construction re- 
quires more space, it obtains better preprocessing and query time. In particular, 
for k = 1 it improves the preprocessing time by a logarithmic factor and yields 
constant query time. We briefly recall how the Corner Index works. Let G(i) 
and g(i) denote the minimum and maximum number of Is in a substring of T 
containing i 0s, respectively. The following result holds: 

Lemma 2 (cf. |2j) Given a Parikh vector (x,y) and a binary string T with 
associated functions G and g, (x,y) occurs in T iff G(x) <y< g(x). 

Hence, to be able to know whether any Parikh vector occurs in T it suffices to 
compute the functions G and g. Since G and g are monotonically increasing, 
they can be encoded by storing only the points where they increase. Let Lq and 
L g be the sets of such points for G and g, respectively. The corner index of T is 
the pair (Lq, L g ). In accordance with the notation from [2], let us use the symbol 
L denoting the whole Corner Index, and obviously we have \Lq\ + \L g \ = \L\. 
From now on we consider the construction of Lq only, since the case of L g is 
analogous and has the same space and time bounds. The set Lq is defined as 

L G = {(i,G(i)) I G(i)<G(i + l)}. 

The function G can be reconstructed from Lq based on the relation G(x) = 
G(r(x)), where 

r{x) = min{i | i > x A (i, G(i)) £ Lq} . 

Let LI rle (T) be the set of Parikh vectors of all the substrings of T beginning 
and ending with full runs of Is. The authors of [2] showed that, in order to 
build L a , it is enough to compute II rle {T), as Lq C II rle {T). Formally, the 
set Lq corresponds to the set of maximal elements of the partially ordered set 
(LT rle (T),t>), where the relation > is defined as 

(x, y) > (x', y') ^ 0, y) j= (x' , y') A x > x' A y < y' . 



If (x,y) > (x',y r ), (x,y) is said to dominate (x',y'). The set LI rle (T) can be 
computed efficiently on the run- length encoding of T in time 0(r 2 ). The total 
time complexity of the procedure to build the Corner Index is 0(r 2 logr), since 



for each such Parikh vector the algorithm performs one lookup and at most one 
insertion and deletion in a balanced tree data structure whose size is at most r 2 . 
We now show how to achieve a trade-off between the space and time complexity. 
We divide the interval [1, |T| ] into sub-intervals of k elements, such that length 
i is mapped to the |_z //cj -th. interval, for i = 1, . . . , \T\q. For each interval Ij = 
[(j — 1) • k + 1, j • k], for j = 1, . . . , |~|T| /fc], we maintain a balanced binary 
search tree in which we store the set of maximal elements of (II rle (T),t>) with 
first component in Ij . If this set is empty, we store the element ( (j — 1) • k + 1 , y) , 
where y is the second component of the element in the previous interval with 
largest first component. The value of G on a given point can then be computed 
in time (9(log k) by doing a lookup in the tree of the corresponding interval. The 
size of this index is 0(n/k + \L\), as we add at most one redundant pair per 
(empty) interval. 

We now describe how this index can be built in time 0(r 2 log k + n/k). The 
construction is divided into two similar steps. We use an array V of \n/k~\ point- 
ers, where each slot points to a binary search tree (BST). During the first step, 
we insert each element (x,y) from II rle (T) into the BST pointed by V[|_2//feJ] if 
it is not dominated by any existing pair and, after an insertion, we also remove 
all the existing elements dominated by it. In this phase we conceptually divide 
the interval [1, |T|i] into sub-intervals (1-intervals) and partition the pairs ac- 
cording to their second component. At the end of this step, in each BST we have 
a superset of the set of maximal elements of (II rle (T),t>) of the corresponding 
1-interval. In particular, all the pairs that are dominated by a pair that belongs 
to a different 1-interval are not removed. The second step of the preprocessing 
removes these elements and builds the final index. To this end, we use another 
array V defined as V. We iterate over V from left to right maintaining in an 
integer x max the maximum first component of any pair belonging to the BSTs 
already processed. For each j = 1, . . . , [|T|i/fc] , we insert all the elements (x, y) 
of the BST V[j] such that x > x max into the BST V'[[x/k\]. 

It remains to prove that all the dominated pairs are removed using this 
procedure. Let (x, y) and (x',y') be two elements of II rle (T) such that (x,y) > 
(x',y'). We distinguish two cases: if the two elements map onto the same 1- 
interval, then only (x, y) remains at the end of the first pass; otherwise we have 
y < y' and so we process (x, y) before (x',y') during the second pass. Hence, 
when we process (x',y') it must hold that x max > x, so (x',y') is skipped. 

Clearly, by definition of the > relation, the size of any tree is bounded by k in 
both phases. Hence, the total procedure can be performed in time 0(r 2 logfc + 
n/k) on the run length encoding of T. To handle the case of trees that are empty, 
it suffices to maintain, during the second pass and for each non-empty interval, 
the element with largest first component in the corresponding tree, and iterate 
over the intervals from left to right in time 0(n/k). 

Observe that, for k = 1, we obtain an index with ((0(r 2 + n),0(n)),0(l)) 
complexity, as promised at the beginning of this section; instead, for k = log n, 
we get {{0(r 2 log log n + nj logn), 0(n/ log n + \L\)), O(loglogn)), i.e., a smaller 
index with sublogarithmic query time. Note that for k = n we obtain the original 



Corner Index. It is also possible to replace the balanced binary search trees with a 
more advanced data structure to achieve even faster query time. In particular, the 
fusion tree [3] is a search tree that requires 0(log n/ log log n) amortized time for 
all the operations. By using it, we obtain an index with ((0(r 2 (log k/ log log k) + 
n/k),0(n/k + \L\)), 0(\og kj log log k)) complexity. 

An interesting question concerns the size of the index L. In [2!, apart from 
experimental estimations for randomly built sequences, only an obvious bound 
of L = 0(n) is given. Another trivial bound is L = 0(r 2 ). Here we show that L 
can be of size &(r 2 ) for r = ^(n 1 / 3 ). This implies that the Corner Index can be 
of size i?(n 2//3 ). To see this, we consider the sequence A 

1, 2, 4, 5, 8, 10, 14, 21, 15, 16, 26, 25, 34, 22, 48, ... , 

where each element A(i) is the smallest positive integer such that the set of 
all sums of consecutive elements up to and including A(i) contains no number 
more than once. This sequence is known in mathematics [119] , but little has been 
proven concerning its properties (cf. [9l Sec. 4]). Still, it is easy to notice that 
for an element A{i), i > 1, the number of sums concerning only consecutive 
elements up to and including A(i) is i(i + 1)/2, and the sums have to be unique. 
The smallest allowed value for A(i) is thus at most i(i + l)/2 = 0(i 2 ) (the 
sum corresponding to the single element A(i) is allowed). Hence, each A(i), 
z > 1, is Oil 2 ) and, from elementary calculus, we obtain that the sum of the 
sequence A(l), A(2), . . . , A(r) is 0(r 3 ). Coming back to our problem; if T = 
Qa Ail)l b M1) Qa M2)1 b A(2) _ Qa Mr)l b Mr) ^ then L = 0(r 2 ) and n = 0{r 3 ), i.e., L = 

fl(n 2 / 3 ). 

4 An algorithm based on word-level parallelism 

In this section we present an algorithm based on word-level parallelism. Each 
interval size is processed separately, and the (packed) text is scanned from left to 
right in chunks of w bits, where w is the machine word size. The text is however 
not packed very tightly; each symbol (0 or 1) is represented with / = 1 + logw 
bits, and we will refer to such sequences of bits encoding a symbol as fields. Let 
k = \ w/ f\ be the number of symbols in a chunk. We denote with C[i] the i-th 
symbol of chunk C. Given a length 1 < I < n, the idea is to slide a window 
of length I over the packed text, i.e., spanning \l/k~\ chunks, and compute the 
number of ones in each alignment. We assume that I > k, since if I is smaller 
than k, we can simply use a naive algorithm to compute maxOne for I. Note 
that if I does not divide fc, the window spans only the prefix of length I mod k 
of the last chunk, which can be obtained with simple bitwise operations. For 
simplicity, we assume that / divides k. To begin with, we compute the maximum 
number of ones in the first alignment, i.e., in the first l/k chunks of the text. 
We then slide the window over the packed text by extending the window by 
one chunk to the right and reducing it by one chunk from the left, i.e., for a 
window beginning at position i we perform a transition from T[i . . .i + I — 1] 
to T[i + k . . .i + k + I — 1]. At each iteration of this process, we compute the 



maximum number of ones among the k new shifts of the window in time (9(log w) 
and update maxOne if needed. When the current window is moved to the right 
by one chunk (i.e., by k symbols), two text chunks are affected, the one including 
the symbols that fall off the window and the one with the symbols that enter 
it. Let us denote these chunks with C\ and C2, respectively. It is not hard 
to see that the maximum number of ones among the new shifts is equal to 
ones + max/tf^]^ 1 Ca[i] — Yli=o where ones is the number of ones in 

the chunks spanned by the previous alignment but not including C\. Hence, 
to update maxOne, we should know if there is at least one prefix of length h 
such that the difference between the sums of C2P . . . h — 1] and Ci[0 . . . h — 1] 
is positive and, if the answer is affirmative, what the maximum difference of 
sums over such equal-length prefixes of C2 and C\ is. Note that we do not have 
to know the prefix length that maximizes this difference. We now show how to 
compute this value for two chunks of k symbols in time 0(\ogw) by reducing 
the problem to the one of computing the prefix sums of a sequence of size k of 
/-bit numbers. We proceed as follows: first, we compute, in constant time, a new 
word C such that 

C'[%\ = C 2 [i] + l-Ci[i], 

for i = 0, . . . , k — 1. The word C holds the differences between the fields in C\ 
and C2 with equal position augmented by one unit. Observe that we defined C 
in this way so as to not let the fields obtain negative values, i.e., all the values 
in C are non-negative. To this end, the addition must also be performed before 
the subtraction. 

Then, we compute the prefix sums of the k symbols of C in time O(logiu) 
by adapting the algorithm by Hillis and Steele [B] to compute the prefix sums 
of an array in parallel. The adaption is straightfoward: we perform log w passes 
over C where the i-th pass computes the value 

c n = {C'U+C'U^iTxf) if i > 

[ C otherwise 

Observe that this step can be computed with a constant number of words of 
space. Moreover, no overflow can occur since the largest sum has value 2k. In 
this way we almost obtain in the word C" the prefix sums of the differences. 
More precisely, we have 

h-l 

C"[h] = ^(C 2 [i] -Ci[i]) + /i + l, 
»=o 

i.e., the h-th difference is off by a value of h with respect to the real value. To fix 
them, we observe that any value C"[h] that is smaller than h is not interesting, 
as the real difference is negative in this case. In what follows, care must be taken 
not to let the fields obtain negative values. We generate a word of increments 
/ with the following field values: 1,2, ... ,k. We find in parallel the maxima of 
the pairs of corresponding fields from C" and /, using the bit trick from [S] that 



works in constant time. This idea uses the top bit of each field for its purpose 
and this is why the fields have exactly 1 + logiu bits. As a result, the new value 
of C" has some fields taken from the "old" C" , and some from /. Finally we 
subtract I from C"; clearly, no field will obtain a negative value, and zeros will 
appear in the fields corresponding to prefix sum differences less than or equal to 
zero. 

The final phase is to find quickly the maximum field value in the whole word. 
Note that we can spend 0(log w) time for this step without changing the overall 
time complexity. We can obtain the desired time complexity with the following 
algorithm: we logically divide C" in two halves and initialize two words with the 
first and last k/2 fields of C" , respectively. Then, we compute in parallel the 
maxima between the two words. Note that the resulting word logically contains 
k/2 fields only. We recursively repeat this process up to logw passes. The last 
word will contain exactly one field, the maximum difference. 

We spend O(logiu) time to process k symbols of the input data, hence the 
speed-up over the naive algorithm is by factor &(w/ log 2 w), which gives overall 
0(n 2 log 2 w/w) time complexity. This dominates over the 0(n 2 / log 2 n) algo- 
rithm from [7] if, roughly speaking, w = J?(log 2+e n), for any e > 0. 
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